Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 11 10:27:22 2023"
The text continues here.
e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?
As an R beginners, I feel this course is very friendly. Both the instructions from the teacher and the book are clear enough for me to understand and move on easily. The learning curve looks smooth, and the instructions are also encouraging. All these provide confidence for a beginner. I have been impressed by the versatile and powerful usage of R, so I want to add it into my research tool box although my research currently does not involve it (potential in the future if I own the skill). In other words, learning R is my interest or hobby at this moment. It is always good to learn some fresh things in my spare time - R is the one recently! I heard about this course from my colleague who usually can make many impressive graphs with statistical data. He became really skillful in operating R and realizing many functions after one-year learning, starting from this course one year ago. His feedback on this course and the learning outcomes really encouraged me to get engaged into R from this course.
How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?
This book is clear, well instructional, and encouraging for an R beginner. I enjoy learning the part of data wrangling and visualization, which is easily accessible and readable. My favorite character of R is the instant feedback after executing the commands, no matter right or wrong. This is what a researcher is eager for, but seldom meets in the academic career after making some efforts. However, I could feel challenging in the data analysis which needs specialized knowledge in statistics to interpret the R outcomes. It could be easy to run the analysis functions but not easy to understand the meaning of the results. Therefore, more things should be learned when I move on to the data analysis in R. The book has provided so many instructions to go through and practice, which are clear to follow. I think it is very nice and enough for me to get start with R. Just need to spend time in it.
Following the detail instruction in “Start me up”, I set up the technical platforms needed in this course, including R, RStudio, and GitHub. Following the teacher’s instruction, I got the know how to push the updated script to the GitHub. Then I started the learning of R with the book “R for Health Data Science”, and the corresponding exercise 1. In this week, I finished the learning of Chapter 1-5 in the book, which was a heavy and time-consuming task, but rewarding! In addition, I read the recommended Part 1 of the textbook “MABS4IODS”, and the two webinars on RStudio and GitHub.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 10:27:22 2023"
Data analysis assignment
Read the students2014 data into R and explore the structure and the dimensions of the data
students2014 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Description: The data set consists of 166 observations of 7 variables. The data present the information of gender, age, attitude, learning points of 166 students, and their responses to the questions related to deep, surface and strategic learning.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# For making a nicer plot
my_lower <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_point(..., alpha=1, pch = 21, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_diag <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_density(..., alpha=0.7, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_discrete <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_bar(..., alpha=1, color ="black") +
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
my_upper_combo<- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_boxplot(...,alpha=1, color ="black") +
geom_jitter(...,alpha = 0.5, width = 0.25, size = 1, pch = 21, color = "black")+
theme(
panel.background = element_rect(fill = "white", color = "black")
)
}
# Plot
p <- ggpairs(students2014,
mapping = aes(col = gender, fill = gender, alpha = 0.7),
lower = list(
continuous = my_lower,
combo = wrap("facethist", bins = 20, color = "black")
),
diag = list(
continuous = my_diag,
discrete = my_discrete
),
upper = list(
combo = my_upper_combo,
continuous = wrap("cor", color = "black")
)
)+
scale_fill_manual(values = c("lightblue", "wheat"))+
scale_color_manual(values = c("lightblue", "wheat"))
# Print the plot
p
# Summaries of the variables in the data
summary(students2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Description: Most of the numeric variables in the data are relatively normally distributed, except for the age which is mostly distributed around twenties. Female are about two times more than males in frequency. Within the variables, attitude towards statistics seems to be most strongly correlated with exam points (r=0.437).
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
# Choose attitude, stra, and surf as the three explanatory variables because they have highest (absolute) correlation with the target variable exam points, according to the plot matrix.
model1 <- lm(points ~ attitude + stra + surf, data = students2014)
# Print out the summary of the first model
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# Remove the unfit variables and new fitted regression model
model2 <- lm(points ~ attitude, data = students2014)
# Print out the summary of the fitted model
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Description: “Attitude” was the single significant predictor in the model. Other two variables entered model–“stra” and “surf”–had p-values 0.12 and 0.47, respectively. Therefore, the model was re-fitted with “attitude” alone, producing a final model that explains 19.06% (R squared = 0.1906) of the variance of the response (exam points).
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Description: The resulting model shows attitude is a meaningful predictor for exam points, specifically:
- a. With a unit of increase in attitude, the exam will increase by 3.5 points
- b. When the effect of attitude is held as none, the average exam points is 11.637. In other words, students would be expected to have 11.637 points in the exam if attitude did not influence at all.
- c. The model predicts a fairly small proportion (19%) of change in exam points. In other words, attitude is not a good enough predictor for exam points, even though its role in influencing the results should be recognized. Random error or other factors should have played roles in the exam points.
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))
Description:
a. Residuals vs fitted plot shows that the data points are randomly scattered around the dotted line of y = 0, and the fitted line (red) is roughly horizontal without distinct patterns or trends, indicating a linear relationship. The linearity assumption of linear regression is examined.
b. The QQ plot shows that most of the points plotted on the graph lies on the dashed straight line, except for the lower and upper ends, where some points deviated from the line, indicating the distribution might be slightly skewed. Nevertheless, the assumption of normality can be approximately met, considering that in large sample size, the assumption of linearity is almost never perfectly met.
c. The Residuals vs Leverage plot indicates that there is no three outlier observation with large Cook’s distances. If the plot shows any outlier observation, they are recommended to be removed from the data because they have high influence in the linear model.
date()
## [1] "Mon Dec 11 10:27:25 2023"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
# read the data
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
# print out the names of the variables
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 370 35
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
This data set includes the information of 370 students’ background, achievement, and alcohol consumption in two Portuguese schools. There are 35 variables in the data, including student grades, demographic, social and school related features, as well as students’ performance in Mathematics (mat) and Portuguese language (por). The data was collected by using school reports and questionnaires.
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
Based on my everyday observation and some research reports about teenagers’ alcohol consumpution, I would like to choose family relationships (“famrel”), number of school absences (“absences”), weekly study time (“studytime”) and frequency of going out with friends (“goout”) as candidate indicators to predict the alcohol consumption. Htpothesis: a student belongs to the group of high alcohol consumption if he or she (1) has low quality family relationship; (2) more frequency of school absences; (3) less weekly study time; and (4) more frequency of going out with friends.
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
library(ggplot2)
library(tidyr)
alc |>
select(absences, studytime, famrel, goout) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "value") |>
ggplot(aes(x = value))+
facet_wrap(~variable, scales = "free")+
geom_bar()+
labs(title = "Distribution of the interested variables",
x = "Values of each variable",
y = "Frequency")
# The relationship between students' absences of class and alcohol consumption (box plot for numerical~categorical variables)
p1 <- alc |>
ggplot(aes(x = high_use, y = absences)) +
geom_boxplot() +
geom_jitter(width=0.25, alpha=0.5)+
labs(x = "Alcohol consumption",
y = "Freuqncy of class absences",
title =
"Frequency of class absences and alcohol consumption")+
scale_x_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p1
# The relationship between quality of family relation and alcohol consumption (bar plot for categorical variables)
p2 <- alc |>
ggplot(aes(x = factor(famrel), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Quality of family relationships",
y = "Proportion of alcohol high-users",
title =
"Quality of family relationships and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p2
# The relationship between going out with friends and alcohol consumption (bar plot for categorical variables)
p3 <- alc |>
ggplot(aes(x = factor(goout), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Going out with friends",
y = "Proportion of alcohol high-users",
title =
"Going out with friends and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p3
# The relationship between weekly study time and alcohol consumption (bar plot for categorical variables)
p4 <- alc |>
ggplot(aes(x = factor(studytime), fill = high_use)) +
geom_bar(position = "fill", color = "black") +
labs(x = "Weekly study time",
y = "Proportion of alcohol high-users",
title =
"Weekly study time and alcohol consumption")+
guides(fill=guide_legend("Alcohol consumption"))+
scale_fill_discrete(labels = c("FALSE" = "Non-high-user",
"TRUE" = "high-user"))
p4
# combining four plots together
library(patchwork)
p1 + p2 + p3 + p4
Four plots were made to explore the relationships between four variables and the alcohol consumption. (1) The box plot shows that the frequency and median(Q1,Q3) of class absences differed greatly between alcohol high-users and non-high-users. The students with high alcohol use are associated with more class absences, as hypothesized. (2) The first bar plot shows a difference result from the hypothesis in terms of the relationship between quality of family relation and alcohol consumption. Students who have worst relationship with their family are not the highest consumption group. Rather the students who consume most alcohol are those who have bad or middle level of family relationship. (3) The second bar plot shows that the more frequency students going out with their friends, the more alcohol consumption, as hypothesized. (4) The third bar plot shows that the more time students spend in studying every week, that less alcohol consumption, as hypothesized.
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables, see for example the RHDS book or the first answer of this stack exchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).
# Fit the model base on the original hypothesis
model <- glm(high_use ~ absences + famrel + goout + studytime, data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ absences + famrel + goout + studytime,
## family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.15221 0.71444 -1.613 0.106798
## absences 0.06655 0.02192 3.036 0.002394 **
## famrel -0.35898 0.13796 -2.602 0.009266 **
## goout 0.75990 0.12143 6.258 3.91e-10 ***
## studytime -0.57194 0.16963 -3.372 0.000747 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.36 on 365 degrees of freedom
## AIC: 380.36
##
## Number of Fisher Scoring iterations: 4
# Important parameters
OR <- coef(model)
CI <- confint(model)
## Waiting for profiling to be done...
ORCI <- cbind(OR, CI)
print(ORCI, digits = 1)
## OR 2.5 % 97.5 %
## (Intercept) -1.15 -2.57 0.24
## absences 0.07 0.02 0.11
## famrel -0.36 -0.63 -0.09
## goout 0.76 0.53 1.01
## studytime -0.57 -0.92 -0.25
All of the hypothesized predictors have good level of statistical significance in the model (p<0.01), indicating that all of the four hypothesized predictors are significant in predicting alcohol consumption. These four predictors have different influences on alcohol consumption. (1) Absences: Participants who have one more time of absence from class will on average have 0.067 (95%CI: 0.02~0.11) times more odds being an alcohol high-user. (2) Quality of family relationships: Every unit of family relationship quality increase would lead to 0.36 (95%CI: 0.09~0.63) times less odds being alcohol high-user. (3) Going out with friends: Students who have one more level of social involvement with their friends have 0.76 (95%CI: 0.53~1.01) times more odds of being alcohol high-users. (4) Weekly study time: Those who have one more level of weekly study time have on average 0.57 (95%CI: 0.25~0.92) times less odds to be an alcohol high-user. These findings are consistent with previous hypotheses.
Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
# Explore the predictive power of the model
probabilities <- predict(model, type = "response")
# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, famrel, goout, studytime, high_use, probability, prediction) %>% tail(10)
## absences famrel goout studytime high_use probability prediction
## 361 3 4 3 2 FALSE 0.22224385 FALSE
## 362 0 4 2 1 FALSE 0.16242879 FALSE
## 363 7 5 3 1 TRUE 0.31573116 FALSE
## 364 1 5 3 3 FALSE 0.08975198 FALSE
## 365 6 4 3 1 FALSE 0.38200795 FALSE
## 366 2 5 2 3 FALSE 0.04697548 FALSE
## 367 2 4 4 2 FALSE 0.36371174 FALSE
## 368 3 1 1 2 FALSE 0.15505370 FALSE
## 369 4 2 5 1 TRUE 0.83529411 TRUE
## 370 2 4 1 1 TRUE 0.09388808 FALSE
# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 233 26 259
## TRUE 63 48 111
## Sum 296 74 370
# Display a graphic visualizing actual high_use and predictions
p5 <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
geom_point()
p5
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
prop.table()%>%
addmargins()%>%
print(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63 0.07 0.70
## TRUE 0.17 0.13 0.30
## Sum 0.80 0.20 1.00
Among 259 participants who are not alcohol high-users, the model correctly predicts 233 (90%) of them. Among 111 participants who are alcohol high-users, the model correctly predicts 63 of them (57%) of them. In all, among the 370 predicts, 74 (20%) were inaccurate.
Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model?
# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
set.seed(1)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
delta.percent <- paste0((cv$delta[1]|> round(4))*100, "%")
The prediction error rate is 24%, outperforming the model in Exercise Set 3, which had about 26% error. According to the result of 10 fold cross-validation, the model has an average error rate of 23.51%, a bit lower than the results from training model.
Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.
date()
## [1] "Mon Dec 11 10:27:27 2023"
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The data set is about housing values in suburbs of Boston. There are 506 observations of 14 variables, including numeric and integer variables. The 14 variables respectively refer to: 1. crim: per capita crime rate by town. 2. zn: proportion of residential land zoned for lots over 25,000 sq.ft. 3. indus: proportion of non-retail business acres per town. 4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 5. nox: nitrogen oxides concentration (parts per 10 million). 6. rm: average number of rooms per dwelling. 7. age: proportion of owner-occupied units built prior to 1940. 8. dis: weighted mean of distances to five Boston employment centres. 9. rad: index of accessibility to radial highways. 10. tax: full-value property-tax rate per $10,000. 11. ptratio: pupil-teacher ratio by town. 12. black: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. 13. lstat: lower status of the population (percent). 14. medv: median value of owner-occupied homes in $1000s.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
# Show summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# First make a distribution plot for 14 variables
library(ggplot2)
library(tidyr)
# Convert the data to long format for easier plotting
long_boston <- gather(Boston)
# Plotting
p1 <- ggplot(long_boston, aes(x = value)) +
geom_density(fill = "skyblue", color = "black") +
facet_wrap(~key, scales = "free") +
theme_minimal() +
labs(title = "Overview of Boston dataset")
# Print the plot
p1
# Then make a correlations plot to look at the correlations among the 14 variables in the data
library(corrplot)
## corrplot 0.92 loaded
# Calculate the correlation matrix and round it
cor_matrix <- cor(Boston) |>
round(digits = 2)
# Print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")
From the distribution plot, most of the variables are skewed to right or
left direction. Only rm is realatively normally distributed. From the
correlation matrix, the strongest correlations are between the variables
rad and tax (positive), dis and age (negative), dis and nox (negative),
dis and indus (negative), lstat and medv (negative).
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# Center and standardize variables
boston_scaled <- scale(Boston) |>
as.data.frame()
# Summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling the data, all the means of the variables turn to zero. Most of the values of the variables ranged from -4 and 4, only except for crim (crime rate).
# Summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# Create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,
labels = c("low", "med_low","med_high", "high"))
# Look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
library(dplyr)
# Number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# Create train set
train <- boston_scaled[ind,]
# Create test set
test <- boston_scaled[-ind,]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
# Fit the linear discriminant analysis
lda.fit <- lda(crime~., data = train)
# Print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2450495 0.2400990 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 0.94768689 -0.9154164 -0.15765625 -0.8664804 0.49324129 -0.8330934
## med_low -0.06128399 -0.2905159 -0.03371693 -0.5543909 -0.09719511 -0.3794097
## med_high -0.37054365 0.2063740 0.05238023 0.3768264 0.15156095 0.3976410
## high -0.48724019 1.0149946 -0.04735191 1.0267822 -0.39946717 0.8122921
## dis rad tax ptratio black lstat
## low 0.8750206 -0.6897369 -0.7780503 -0.4386754 0.38100748 -0.767876258
## med_low 0.3826266 -0.5456857 -0.4654718 -0.1017024 0.31146285 -0.162813217
## med_high -0.4097682 -0.3756701 -0.2603363 -0.2604135 0.09270391 0.006875968
## high -0.8535887 1.6596029 1.5294129 0.8057784 -0.76867030 0.834420910
## medv
## low 0.5582392
## med_low 0.0121384
## med_high 0.2007779
## high -0.6528397
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11885002 0.66122613 -0.98747678
## indus -0.04130912 -0.23640126 0.23839579
## chas -0.07358997 -0.01543395 0.21302462
## nox 0.33296411 -0.76745164 -1.35199275
## rm -0.08676976 -0.07017084 -0.15558456
## age 0.29552614 -0.15968266 -0.24988799
## dis -0.11193174 -0.11699745 0.17008638
## rad 3.12327648 1.09445683 -0.22148103
## tax 0.10112350 -0.22764844 0.78165323
## ptratio 0.14306963 0.10445909 -0.33126409
## black -0.15592597 -0.01390650 0.09298486
## lstat 0.17766584 -0.24785436 0.32723953
## medv 0.17126083 -0.36432554 -0.26921116
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9537 0.0335 0.0128
# Draw the LDA (bi)plot
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Target classes as numeric
classes <- as.numeric(train$crime)
# Plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Biplot based on LD1 and LD2 was generated. From the LDA biplot, the
clusters of low, med_low, and med_high classes separate poorly. Heavy
overlap was observed between these three clusters. The cluster of high
class separates well. But the clusters high and med_iumH_high also
showed notable overlaps. Based on arrows, varaibles rad explained the
most for cluster of high class. Contributions of variables to other
clusters are not clear enough due to the heavy overlap.
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# Save the correct classes from test data
correct_classes <- test$crime
# Remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Predict classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 8 0 0
## med_low 5 16 6 0
## med_high 0 15 14 0
## high 0 0 1 21
The cross tabulated results show that most of the predictions on the classes of med_low, med_high, and high are correct. But the prediction of the low class has just only less than a half correctness. This result shows a not so satisfactory predicting effect of our linear discriminant analysis.
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
data("Boston")
boston_scaled <- scale(Boston) |>
as.data.frame()
# Euclidean distances matrix
dist_eu <- dist(boston_scaled)
# Look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The optimal number of clusters is when the value of total WCSS changes
radically. In this case, two clusters would seem optimal.
# K-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)
# Plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:7], col = km$cluster)
pairs(boston_scaled[8:14], col = km$cluster)
Most of the variables are influential linear separators for the
clusters, except rad, ptratio, and tax.
Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p3 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
color = train$crime, #Set the color to be the crime classes of the train set.
type= 'scatter3d',
mode='markers',
size = 2)
p3
# Draw another 3D plot where the color is defined by the clusters of the k-means
# Get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 2)
p4 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = factor(train.km$cluster), #color defined by clusters of the k-means
size = 2)
p4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The LDA was trained according to a mathematical category of crime rates (quantiles), which has four categories. While k = 2 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 2 clusters identified before observing closely. However, by observing the 3D plots together, it is interesting to find out that, cluster 2 from k-means nicely overlaps with high category from LDA. Also, cluster 1 from k-means roughly overlaps with the other three categories from LDA.
date()
## [1] "Mon Dec 11 10:27:31 2023"
Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ plotly::select() masks MASS::select(), dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
human_1 <- column_to_rownames(human, "Country")
summary(human_1)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
# For making a nicer plot, creat some functions
# Define a function that allows me for more control over ggpairs. This function produces point plot with fitted lines.
my.fun.smooth <- function(data, # my function needs 3 arguments
mapping,
method = "lm"){
ggplot(data = data, # data is passed from ggpairs' arguments
mapping = mapping)+ # aes is passed from ggpairs' arguments
geom_point(size = 0.3, # draw points
color = "black")+
geom_smooth(method = method, # fit a linear regression
size = 0.3,
color = "red")+
theme(panel.grid.major = element_blank(), # get rid of the grids
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "wheat", #adjust background
color = "black"))
}
# Define a function that allows me for more control over ggpairs. Tthis function produces density plot.
my.fun.density <- function(data, mapping, ...) { # notes are roughly same with above
ggplot(data = data, mapping = mapping) +
geom_histogram(aes(y=..density..),
color = "black",
fill = "white")+
geom_density(fill = "#FF6666", alpha = 0.25) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "lightblue",
color = "black"))
}
ggpairs(human_1, #data
progress = FALSE,
lower =
list(continuous = my.fun.smooth), # lower half show points with fitted line
diag =
list(continuous = my.fun.density), # diagonal grids show density plots
title = "Relationships between variables") + # title
theme (plot.title = element_text(size = 22, # adjust title visuals
face = "bold"))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Only the expected years of education (Edu.Exp) is normally distributed.
The rest of variables are skewed in one way or another. Most of the
variables have strong correlation with other variables, except Parli.F
and Labo.F.
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
# Perform principal component analysis on the raw (non-standardized) human data
pca_human <- prcomp(human_1)
print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
PC1 explains 99.99% of the variability of the data set. Other components’ contribution is less the 0.1% in totality.
biplot_1 <- biplot(pca_human, choices = 1:2,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
In the biplot, red texts are variables, and grey texts are observations
(countries). The position of GNI is far away from the origin (0,0) in
the direction of x axis (PC1), indicating its strong contribution to
PC1. Most of the countries clustered tightly around the origin (0,0),
which indicates that they are not well-represented on the factor
map.
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
human_std <- scale(human_1)
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.071 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.536 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.536 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
After standardizing, PC1 explains 53.6% of the variability of the data set. PC2 explains the other 16%. PC1 and PC2 together can explain about 70% of the variability of the data set.
biplot_2 <- biplot(pca_human_std, choices = 1:2,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"))
After standardizing data set, row and column points are more well scattered across the coordinate panel and all the variables are visualized more reasonably. The scattered country names are well-represented in the factor map.
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
Base on finding above, it is not hard to draw the conclusion that PCA using standardized data set produces results better for analysis. Possible explanation for this is the variables with different scales make the comparison between pairs of features difficult. PCA is calculated based on co-variance. Unlike correlation which is dimensionless, covariance is in units obtained by multiplying the units of the two variables. When data set is not scaled, this makes each variable not easily comparable with others (since they all have their own value ranges). Further, each variable loads almost exclusively on one components because they can hardly find another variable with comparable value range. This assumption is further consolidated by the fact that the only two variables with smaller loading scores are Edu2.FM and Labo.FM, both of which happen to have similar value range from 0 to 1. Also, co-variance also gives some variable extremely high leverage in our data set. Look back to the summary of the raw human data, “GNI” has a scale tremendously larger than other variables. This might lead to its large co-variances with any other variable, and further results in its over-contribution to the factor solution. All of these mis-representation of data would end up the poor quality of contribution, and hence the biplot shows most of the countries clustered tightly together, indicating the PCA has not produced a factor map with acceptable dissimilarity among rows. Also, the over-contribution of GNI to the factor solution leads to a graph with only one variable–GNI–showing in a visible distance (others overlap heavily around the center).
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
# Load the data
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
# Convert its character variables to factors
# According to the structure, the characters are already factors.
Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.
# Select columns
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# Visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+
geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
# MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# Visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
### 5.3 Interpretation
Multiple Correspondence Analysis (MCA) is a multivariate statistical technique used for analyzing the relationships between categorical variables. MCA results are often presented in terms of principal components.
Eigenvalues in the summary represent the amount of variance captured by each principal component. Higher eigenvalues indicate more important dimensions. The two most important dimensions explain respectively 15.238% and 14.232% variables.
The Plot is a variable plot which examines the dispersion of categories in six variables along the two principal components. Categories close to each other are positively associated, while those far apart are negatively associated. And the relationships between categories can be seen from their degree of proximity. The positions of the categories on the principal components can indicate the contribution of the dimension. For example, sugar and no suger contribute little to both dimensions, while enjoying at tea shop and unpackaged tea contibute greatly in the first dimension.
date()
## [1] "Mon Dec 11 10:27:36 2023"
1. Implement the analyses of Chapter 8 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from Chapter 9 and Meet and Repeat: PART II).
2. Implement the analyses of Chapter 9 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART II, but using the BPRS data (from Chapter 8 and Meet and Repeat: PART I).
library(tidyverse)
# Load the data from the assignment of data wrangling, including BPRS data and RATS data
source("meet_and_repeat.r")
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
## 'data.frame': 16 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr [1:360] "week0" "week0" "week0" "week0" ...
## $ bprs : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
# Check the content of the data
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
summary(RATS)
## ID Group WD1 WD8 WD15
## 1 : 1 1:8 Min. :225.0 Min. :230.0 Min. :230.0
## 2 : 1 2:4 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## 3 : 1 3:4 Median :340.0 Median :345.0 Median :347.5
## 4 : 1 Mean :365.9 Mean :369.1 Mean :372.5
## 5 : 1 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## 6 : 1 Max. :555.0 Max. :560.0 Max. :565.0
## (Other):10
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
##
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
##
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "time"
str(RATSL)
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ time : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
## ID Group WD Weight time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
Description of the RATS data: There are 16 rats in this nutrition study which named with ID from 1 to 16. These 16 rats were divided into three treatment groups. Group 1 includes eight rat, while Group 2 and Group 3 respectively have four rats. All these 16 rats in three treatment groups took 11 repeated measurements, resulting in 176 (16x11) unique observations. Three groups were put on different diets, and each rat’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
# Plot the weights of all the observation over the time in a plot, differentiating three treatment groups.
library(ggplot2)
RATSL |>
ggplot(aes(x = time, y = Weight, group = ID, color = Group))+
geom_line()+
geom_point()+
labs(title = "Change of weight in three treatment groups",
x = "Time (days)",
y = "Weight (grams)")+
theme(plot.title = element_text(size = 12, face = "bold"),
panel.background = element_rect(fill = "white",
color = "black"),
panel.grid.major = element_line(color = "grey", size = 0.2),
panel.grid.minor = element_line(color = "grey", size = 0.2),
strip.background = element_rect(color = "black",#adjust the strips aes
fill = "steelblue"),
strip.text = element_text(size =10,
color = "white"),
legend.position = "none")+
facet_wrap(~Group)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Findings: 1. All individuals increased weights over the time. 2.The rats of the Group 1 are far lighter than the rats in Group 2 and 3. The rats in Group 2 and 3 have similiar level of weights, comparing to Group 1.
In order to reduce the influence of the original weight in tracking the changes over the time, we can standardized the observed values (weights) of each observation, i.e., \[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]
# Standardise the variable weight
RATSL <- RATSL |>
group_by(time) |>
mutate(stdwgt = (Weight - mean(Weight))/sd(Weight)) |>
ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## $ stdwgt <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0.…
# Plot again with the standardised weight
RATSL |>
ggplot(aes(x = time, y = stdwgt, group = ID, color = Group))+
geom_line()+
geom_point()+
labs(title = "Change of standardised weight in three treatment groups",
x = "Time (days)",
y = "Standardized weight")+
theme(plot.title = element_text(size = 12, face = "bold"),
panel.background = element_rect(fill = "white",
color = "black"),
panel.grid.major = element_line(color = "grey", size = 0.2),
panel.grid.minor = element_line(color = "grey", size = 0.2),
strip.background = element_rect(color = "black",#adjust the strips aes
fill = "steelblue"),
strip.text = element_text(size =10,
color = "white"),
legend.position = "none")+
facet_wrap(~Group)
Findings: The changes of the weight look not so obvious over the time for all the individual rats.
In addition to displaying the individual profiles, another approach is showing average (mean) and standard error of mean profiles for each treatment group along with some indication of the variation of the observations at each time point. \[se = \frac{sd(x)}{\sqrt{n}}\]
# Summary data with mean and standard error of rats by group and time
rats.group <- RATSL |>
group_by(Group, time) |>
summarise(mean = mean(Weight),se = sd(Weight)/sqrt(n())) |>
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(rats.group)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
# Plot the mean profiles of the groups over the time
library(ggplot2)
rats.group |>
ggplot(aes(x = time,
y = mean,
shape = Group,
color = Group))+
geom_line()+
geom_point(size=3)+
theme(legend.position = c(0.9,0.5),
panel.background = element_rect(fill = "white",
color = "black"),
panel.grid = element_line(color = "grey",
size = 0.1),
axis.text = element_text(size = 10),
axis.title = element_text (size = 13),
plot.title = element_text(size = 15,
face = "bold")) +
labs(title = "Change of average weight of three groups over the time",
x = "Time(days)",
y = "Average(grams)")+
scale_color_manual(values = c("wheat4", "steelblue", "darkred"))
# Add error bar (mean±2se) to the plot
## Create an object that saves dodge position so that point and line dodge simultaneously (for preventing overlap)
dodgeposition <- position_dodge(width = 0.3)
rats.group |>
ggplot(aes(x = time,
y = mean,
shape = Group,
color = Group))+
geom_line(position = dodgeposition) + #dodge to avoid overlap
geom_point(size=3, position = dodgeposition) +#dodge to avoid overlap
geom_errorbar(aes(ymin=mean-2*se, ymax=mean+2*se),
width=0.5, #set width of error bar
position =dodgeposition) +#dodge to avoid overlap
theme(legend.position = c(0.9,0.5),
panel.background = element_rect(fill = "white",
color = "black"),
panel.grid = element_line(color = "grey",
size = 0.1),
axis.text = element_text(size = 10),
axis.title = element_text (size = 13),
plot.title = element_text(size = 15,
face = "bold")) +
labs(title = "Change of weight statistics (mean±2se) of three groups over time",
x = "Time(days)",
y = "Average weight±2se(grams)")+
scale_color_manual(values = c("wheat4", "steelblue", "darkred"))
Findings: 1. The average weights of each group increased with mild slopes over the time. 2. The average weights of the rats among the groups: Group 3 > Group 2 > Group 1. 3. According to the error bar (no overlap between Group 1 and Group 2,3), rats differed tremendously in weight from the very outset and kept the significant group difference over the time.
# Create a summary data by groups and subject with mean as the summary variable (ignoring baseline WD1, "RATSL_nobase")
library(dplyr)
RATSL_nobase <- RATSL |>
filter(time > 1) |>
group_by(Group, ID) |>
summarise(mean=mean(Weight) ) |>
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATSL_nobase)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Draw a boxplot of the mean versus groups
library(ggplot2)
RATSL_nobase |>
ggplot(aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
scale_y_continuous(name = "mean(weight(g)), time 8-64")
# Find and filter the outliners
## Create a function that detect outliers
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
## Mutate a new variable "outlier" to tag outlier weight
ratsl_outl <- RATSL_nobase|>
group_by(Group) |>
mutate(outlier = ifelse(is_outlier(mean), ID, as.factor(NA))) #create outlier label
## Find the outlier in the boxplot
ratsl_outl |>
ggplot(aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
scale_y_continuous(name = "mean(weight(g)), time 8-64")+
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)
## Filter the outlier ("RATSL_nobase1") and adjust the ggplot code the draw the plot again with the new data
RATSL_nobase1 <- ratsl_outl |>
filter(is.na(outlier))
ggplot(RATSL_nobase1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
scale_y_continuous(name = "mean(weight(g)), time 8-64")
After removing the outliers, IQRs are much narrower than those before removing the outliers.
Apply ANOVA to assess any difference among three treatment groups. Use the data without the outliers.
anova <- aov(mean ~ Group, data = RATSL_nobase1)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 176917 88458 2836 1.69e-14 ***
## Residuals 10 312 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value < 0.05 indicates that there are differences in the means of weights across three groups. In order to know which pair(s) of groups have significant difference, apply pairwise T-Test between groups.
pairT <- TukeyHSD(anova)
print(pairT)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSL_nobase1)
##
## $Group
## diff lwr upr p adj
## 2-1 183.64286 173.07885 194.20686 0
## 3-1 269.50952 258.94552 280.07353 0
## 3-2 85.86667 73.36717 98.36617 0
There are statistically significant differences (p < 0.0001) between each pair of groups. But we don’t know whether these differences are only due to the differences across groups, or also due to the differences in the baseline. Thus, linear regression with baseline as a new variance can help us to find the influence of baseline in the mean differences across groups.
# Add the baseline from the original data as a new variable to the summary data
RATSL_baseline <- inner_join(RATSL_nobase1, RATS, by = c("Group","ID")) |>
select(Group, ID, mean, WD1) |>
mutate(baseline = WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ Group + baseline, data = RATSL_baseline)
summary(fit)
##
## Call:
## lm(formula = mean ~ Group + baseline, data = RATSL_baseline)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6341 -2.8915 0.1102 2.0096 7.8989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 221.3094 28.3120 7.817 2.66e-05 ***
## Group2 152.7218 18.7452 8.147 1.91e-05 ***
## Group3 219.6183 29.9107 7.342 4.36e-05 ***
## baseline 0.1866 0.1111 1.680 0.127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.136 on 9 degrees of freedom
## Multiple R-squared: 0.9987, Adjusted R-squared: 0.9982
## F-statistic: 2236 on 3 and 9 DF, p-value: 3.048e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 176917 88458 3353.2062 1.181e-13 ***
## baseline 1 74 74 2.8219 0.1273
## Residuals 9 237 26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Baseline has no significant effect on the mean differences across groups. After adjusting with baseline effect, the mean differences across groups are still significant (p<0.05).
# Check the content of the data
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
summary(BPRS)
## treatment subject week0 week1 week2
## 1:20 1 : 2 Min. :24.00 Min. :23.00 Min. :26.0
## 2:20 2 : 2 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## 3 : 2 Median :46.00 Median :41.00 Median :38.0
## 4 : 2 Mean :48.00 Mean :46.33 Mean :41.7
## 5 : 2 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## 6 : 2 Max. :78.00 Max. :95.00 Max. :75.0
## (Other):28
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
##
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
##
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs" "week"
str(BPRSL)
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr [1:360] "week0" "week0" "week0" "week0" ...
## $ bprs : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
Description of the BPRS data: There are 40 males in this study and they were divided into two treatment groups. Each group has 20 males, i.e. the subject from 1 to 20. Each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia. This study applies the linear mixed effects models to study the effects of different treatments on the BPRS rating over the time, with the consideration of the variability across the individuals in the groups.
To begin, plot both treatment effects over weeks by individuals, but ignoring the longitudinal nature (repeated-measures structure) of the data.
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject, color = subject)) +
geom_line()+
facet_wrap(~treatment)+
theme(legend.position = "none",
panel.grid = element_line(color = "grey", size = 0.1),
panel.background = element_rect(color = "black",
fill = "white"),
strip.background = element_rect(color = "black",
fill = "steelblue"),
strip.text = element_text(color = "white",
face = "bold",
size = 10),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10))+
labs(title = "Two treatment effects on BPRS over week by individuals",
x = "Time (weeks)",
y = "BPRS rating")
Findings: 1. The BPRS rating of the participants decreased in both treatment groups over the time. 2. The changing trajectories and the starting points across individuals differed greatly. 3. It can not be straightforward to see which treatment is better.
Continuing to ignore the repeated-measures structure of the data, fit
a multiple linear regression model with BPRS rating as response and
Time(weeks) and Treatment as explanatory
variables.
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
confint(BPRS_reg)
## 2.5 % 97.5 %
## (Intercept) 43.765472 49.142306
## week -2.766799 -1.774035
## treatment2 -1.991083 3.135527
From the p values, time(week) is significant factor of the BPRS rating, but treatment does not significant influence on the rating. For every new week, they on average experienced an decrease of rating by 2.27 (95%CI -2.77 to -1.77) from the previous one.
The previous model assumes independence of the repeated measures of
rating, and this assumption is highly unlikely. To conduct the more
formal analysis of the BPRS rating data, we will first fit the random
intercept model for the same two explanatory variables:
Time(weeks) and Treatment. Fitting a random
intercept model allows the linear regression fit by considering that
subjects have different rating baselines, which referred to the random
intercept for each individual.
However, in the BPRSL data, various subjects in both treatments use number 1 to 20, which means the same subject number refers to two different individuals receiving two treatments. This will cause problem in the mixed-effect modeling since the subject will be included in model. So we need to convert the subject numbers first.
library(dplyr)
BPRSL_new <- BPRSL |>
mutate(subject = as.numeric(subject)) |>
mutate(subject = ifelse(treatment ==2, subject + 20, subject))
glimpse(BPRSL_new)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Now fit the random intercept model.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL_new, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL_new
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
confint(BPRS_ref)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 7.918218 12.645375
## .sigma 6.828332 7.973573
## (Intercept) 41.744598 51.163179
## week -2.565919 -1.974914
## treatment2 -5.885196 7.029641
The fixed effect part of model summary here shows same results of coefficient with multiple linear regression in 3.2. We focus more on the random effect part of the model summary. On average (see Std.Dev. of subject), BPRS rating bounces around 9.87 (95%CI: 41.74 to 51.16) as moving from one participant to another. In fact, the influence of individual differences is even larger than the effect of time - the significant predictor evidenced in the fixed linear model (coefficient -2.27, 95%CI: -2.56 to -1.97).
Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. So it is possible to account for the individual differences in the BPRS rating profiles, but also the effect of time and treatment.
BPRS_refboth <- lmer(bprs ~ week + treatment + (treatment | subject),
data = BPRSL_new,
REML = FALSE)
summary(BPRS_refboth)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (treatment | subject)
## Data: BPRSL_new
##
## AIC BIC logLik deviance df.resid
## 2584.8 2612.0 -1285.4 2570.8 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3066 -0.6047 -0.0635 0.4402 3.1691
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.43 8.027
## treatment2 57.38 7.575 0.07
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9708 23.571
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.305
## treatment2 -0.556 0.000
We focus more on the random effect part as well here. On average, BPRS rating bounces around 8.027 (see Std.Dev. of subject) as moving from one participant to another, lower than in random intercept model. The deviation for the treatment is 7.58, much higher than in fixed effect model (even it shows insignificant). This indicates as the changing of individuals, the effect of a different treatment could be huge, and this is an important finding since this shed some lights on why a different treatment does not show significantly different effect.
In a word, the individual differences have a great effect on the BPRS rating, even larger than the effect of time and treatment; and the individual differences also have great effect on the reaction to different treatments.
Fitting a random intercept and slope model that allows for a group × time interaction as a final work.
BPRS_interact <- lmer(bprs ~ week * treatment + (treatment | subject),
data = BPRSL_new,
REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
summary(BPRS_interact)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (treatment | subject)
## Data: BPRSL_new
##
## AIC BIC logLik deviance df.resid
## 2581.0 2612.1 -1282.5 2565.0 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2801 -0.6048 -0.0810 0.4414 3.3943
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.53 8.033
## treatment2 71.91 8.480 -0.04
## Residual 53.27 7.298
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.0573 23.275
## week -2.6283 0.2107 -12.475
## treatment2 -2.2911 3.4296 -0.668
## week:treatment2 0.7158 0.2980 2.402
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.410
## treatment2 -0.600 0.246
## wek:trtmnt2 0.290 -0.707 -0.348
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
# Perform an ANOVA test on the two models
anova(BPRS_refboth, BPRS_interact)
## Data: BPRSL_new
## Models:
## BPRS_refboth: bprs ~ week + treatment + (treatment | subject)
## BPRS_interact: bprs ~ week * treatment + (treatment | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_refboth 7 2584.8 2612.0 -1285.4 2570.8
## BPRS_interact 8 2581.0 2612.1 -1282.5 2565.0 5.7203 1 0.01677 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Paying attention to the likelihood ratio test chi-squared value and the according p-value. The lower the value the better the fit against the comparison model. It can be seen that the interaction between week and treatment is significant.